Topic modeling of the fMRI literature¶
In this analysis we take results obtained by applying BERTopic to text from PubMed abstracts that match the following query:
'("fMRI" OR "functional MRI" OR "functional magnetic resonance i
maging") AND (brain OR neural OR neuroscience OR neurological OR psychiatric OR
psychology) AND %d[DP]' % year
We focus here on the years 2002 through 2022; we excluded earlier years because the number of matching pubmed abstracts for those earlier years was quite small.
First we load the relevant functions:
from fmritopics.fit_dynamic_topic_model import load_data, get_embeddings
from fmritopics.analyze_dynamic_topics import (
load_model,
get_topics_over_time,
get_hierarchical_topics,
plot_hierarchical_topics,
get_top_topics_over_time,
plot_top_topics,
get_slopes
)
from collections import defaultdict
import pandas as pd
import umap
import umap.plot
import numpy as np
import matplotlib.pyplot as plt
Now we need to select the topic modeling parameters. There are two main parameters that determine the number of topics in BERTopic: The number of neighbors used in the UMAP dimensionality reduction step, and the minimum cluster size for the HDBSCAN clustering step. We ran models with various values of these parameters; while the overall conclusions are robust to these different parameters, specific differences in topics and their changes over time did occur for some values.
min_cluster_size = 250
n_neighbors = 25
datadir = 'data'
modeldir = 'models'
sentences, years = load_data(datadir)
minclust, nneighbors = min_cluster_size, n_neighbors
topic_model, embedding_model, model_name = load_model(minclust, nneighbors, modeldir)
topics_over_time = get_topics_over_time(sentences, years, topic_model)
hierarchical_topics, tree = get_hierarchical_topics(topic_model, sentences, viz=True)
File data/bigrammed_cleaned_abstracts_1991.pkl does not exist Loaded model from models/model-bertopic_minclust-250_nneighbors-25_gpt4/model-bertopic_minclust-250_nneighbors-25_gpt4 getting topics over time
32it [00:09, 3.36it/s] 100%|██████████████████████████████████████████| 52/52 [00:00<00:00, 210.05it/s]
We can visualize the hierarchical relationships between the different topics. To label each of the topics, we use GPT-4 to generate labels based on the text in the documents associated with each topic.
Note that the long labels get chopped off and there is no easy way to fix this, but the meaning of most topics is still pretty clear.
fig = topic_model.visualize_hierarchy()
fig
We can also look at an embedding of the multidimensional space of relationships between topics using dimensionality reduction (via the UMAP technique). The following cell will create a visualization of documents and topics; it is easiest to see the individual topics by sliding the slider to the right in order to show fewer topics. We will look at this kind of plot in more detail later in the notebook.
embeddings, embedding_model = get_embeddings(sentences)
fig, reduced_embeddings = plot_hierarchical_topics(
topic_model, embeddings, sentences, hierarchical_topics,
min_cluster_size, n_neighbors)
fig
using existing embeddings from data/embeddings.pkl